-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix batched impl for NVidia GPU #6164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. LGTM!
This code reverts the code change of mine: #5895 I think this PR will reduce performance on Intel GPU.
|
The PR uses mix fp32 and fp16 to increase performance of fp32 mode. It will reduce the accuracy due to involve fp16. |
Yes, this will decrease performance for iGPUs. However, oneMKL does not support the mixed datatypes for I propose taking the less optimal, but more generic path, to maintain support for all targets temporarily and keep the relevant tests passing. Once oneMKL is updated then we can revert this change and have the benefits you are describing. |
Great! |
I think we should provide a good framework/rule to handle this case: different optimization for different hardware structure. Here is my proposal:
For alternative 2. Invite @ggerganov @slaren |
Just some thoughts: my goal in the long term is to move backends to dynamic libraries that can be loaded at runtime. Then it will be possible to use multiple backends simultaneously, and it would be possible use CUDA for an NVIDIA device, and SYCL for an Intel device at the same time. So there is no need for the SYCL backend to support every vendor, if there is already a backend available for that vendor that may provide better performance. I would be good of course for the SYCL backend to work with every device just to have more options, but the effort may be better spent improving the performance with Intel devices. |
Yes, I think the long-term idea is good. @AidanBeltonS maybe you could use alternative 1 to handle your code for NV/AMD. |
I think simple ifdef macro branching can be quite useful here , keeping the rest of the code performant with igpu. |
Mix fp32 and fp16 still have accuracy loss issue. |
421da29
to
d1cb8fe
Compare
Iv protected the previous approach using a backend check at runtime. This I think is cleaner than using a macro and checking the backend is a cheap operation so there should be no noticeable drop in performance. It has been checked on the Intel + NVidia GPUs. |
All tests are passing with the approach so any possible issues with precision do not seem to be significant |
Can be merged. CI failure not related to SYCL. cc @ggerganov |
* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>
* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>
* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>
This PR fixes the failing batched_gemm call. oneMKL open source does not support combined halfs and floating points right now.
This change uses all halfs for the computation, this resolves failing tests in the Nvidia build.
I am currently in the process of adding these dtypes feature to oneMKL and this change can be reverted.
Tested on A100 and PVC